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ABSTRACT 

The Dual Excitation (DE) speech model is applied to the 
problem of speech enhancement. The use of this model and 
its novel decomposition of speech into co-existing voiced and 
unvoiced components allow removal of additive wideband 
noise from the degraded speech with only the knowledge of 
the power spectrum of the noise. The unique properties of 
each component are exploited to improve the performance 
of the enhancement system. Informal comparisons between 
the DE speech enhancement system and a traditional spec- 
tral subtraction algorithm show a clear preference for the 
DE enhancement system. 

1. INTRODUCTION 

Degradation caused by additive wideband acoustic noise is 
common in many communication systems, where the distur- 
bance varies from low-level office noise in a normal phone 
conversation to high volume engine noise in a helicopter or 
an airplane. In general, the addition of noise reduces in- 
telligibility and introduces listener fatigue. Consequently it 
is desirable to develop an automated speech enhancement 
procedure for removing this type of noise from the speech 
signal. 

Many different types of speech enhancement systems have 
been proposed and tested [1, 2, 3, 4]. The performances of 
these systems depend upon the type of noise they are de- 
signed to remove and the information which they require 
about the noise. The focus of this work has been on the 
removal of wideband noise when only a single signal con- 
sisting of the sum of the speech and the noise is available 
for processing. 

Due to the complexity of the speech signal and the limi- 
tations inherent in many previous speech models, model- 
based speech analysis /synthesis systems are rarely used 
for speech enhancement. Typically, model-based speech 
enhancement systems introduce artifacts into the speech 
which become worse as the signal- to-noise ratio decreases. 
As a consequence most speech enhancement systems to date 
have attempted to process the speech waveform directly 
without relying on an underlying speech model. 

One common speech enhancement method is spectral 
subtraction [3], The basic principle behind this method is 
to attenuate frequency components which are likely to have 
a low speech- to-noise ratio, while leaving frequency com- 
ponents which are likely to have a high speech-to-noise ra- 
tio relatively unchanged. Spectral subtraction is generally 


considered to be effective at reducing the apparent noise 
power in degraded speech. However, this noise reduction is 
achieved at the price of reduced speech intelligibility. Mod- 
erate amounts of noise reduction can be achieved without 
significant intelligibility loss, however large amounts of noise 
reduction can seriously degrade the intelligibility of the 
speech. The attenuation characteristics of spectral subtrac- 
tion typically lead to a de-emphasis of unvoiced speech and 
high frequency formants. This property is probably one of 
the principal reasons for the loss of intelligibility. Other dis- 
tortions introduced by spectral subtraction include “tonal 
noise” . 

In this paper, we introduce a new speech enhancement 
system based on the DE speech model which overcomes 
some of the aforementioned problems. The DE system is 
used to separate speech into voiced and unvoiced compo- 
nents. Since the acoustic background noise has characteris- 
tics which are similar to unvoiced speech, the unvoiced com- 
ponent will be principally composed of the unvoiced speech 
plus the background noise. The voiced component will be 
principally composed of the harmonic components of the 
speech signal. As a consequence, speech enhancement can 
be achieved through subsequent processing of the unvoiced 
component to reduce the apparent noise level. New process- 
ing methods have been derived which take advantage of the 
unique properties of the individual components in order to 
reduce the distortion introduced into the processed speech. 

2. DUAL EXCITATION SPEECH MODEL 

The DE speech model overcomes some of the limitations 
of traditional speech models [5], In traditional speech mod- 
els, speech is viewed as the response of a time varying linear 
filter to some excitation sequence, and depending on the na- 
ture of the excitation sequence, speech is modeled as voiced 
or unvoiced. In voiced speech, the excitation is modeled 
as a periodic impulse sequence, while in unvoiced speech 
the excitation is modeled as a white noise sequence. This 
speech model which makes hard voiced /unvoiced decisions 
does not adequately characterize the excitation signal. Al- 
gorithms typically used to estimate the model parameters 
and synthesize speech based on this type of speech model 
are not sufficiently robust to degradations such as back- 
ground noise which may exist in the original speech. 

In the DE speech model the speech signal Su,(n) is sepa- 
rated into two independent components — a voiced compo- 
nent and an unvoiced component denoted respectively as 
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Vu»(«) and The subscript signifies that each term is 

a short-time segment which is obtained by application of 
a window function to(n). The speech signal $ w (n) can be 
expressed in the Fourier domain as, 

Su,(u>) = V^w) + U w (u) (1) 

where S w (u>) t V w {u) and are the Fourier Transforms 

of Sui(n), t>u>(n) and u w (n) respectively. 

The voiced component by definition is assumed to be pe- 
riodic over the time duration of the window tu(n), and thus 
the pitch period Pq can be used to form a harmonic series 
representation for the voiced portion of each speech seg- 
ment. Mathematical expressions for v w (n) and V^o;) are 
given by 

M 

v«(n) = Y, Amw(n)e~ inm “° (2) 

mn—M 

M 

Vui(w) = ^ A m W(u> — mwo) (3) 

m= —M 

where W{uj) is the Fourier Transform of the window func- 
tion u?(n) and is essentially a narrowband lowpass filter. 
Thus, V w (w) is the sum of various harmonics of the fun- 
damental frequency wo. The parameter A m represents the 
amplitude of the m’th harmonic. The parameter wo rep- 
resents the fundamental frequency which is related to the 
pitch period Pq by 


The number of harmonics, M, is a function of the funda- 
mental frequency and is given by 

M = [£-J (5) 

Wo 

where [-J denotes the smallest integer less than or equal to 
the argument. 

In practice the DE model parameters are not known and 
must be estimated from the speech spectrum. The esti- 
mated fundamental frequency, harmonic amplitudes and 
voiced spectrum are denoted by wo, A m and V w . The esti- 
mates of the fundamental frequency and the harmonic am- 
plitudes are obtained with an algorithm developed by Grif- 
fin [6] which minimizes the mean-squared error between the 
original speech spectrum S u ,(w) and the voiced spectrum 
Vt(w). This algorithm ensures that the voiced component 
will contain all of the harmonic structure which is in the 
original speech. The unvoiced spectrum U w (u) is estimated 
from the difference spectrum D w (u) given by 

D w (u;) — S w (us) — Vu/(w) (6) 

There are various approaches for estimating the unvoiced 
spectrum f/ w (w) from D w (w) [5], The approaches exploit 
the fact the fine structure of the unvoiced magnitude spec- 
trum does not need to be preserved thereby allowing differ- 
ent types of smoothing on the spectral magnitude of D w (w). 
The use of smoothing on the unvoiced component reduces 
the effects of noise on the estimate of the unvoiced mag- 
nitude spectrum. The phase of the unvoiced component is 



Figure 1: Noisy speech passage “took” 



f 


Figure 2: (a) Voiced and (b) unvoiced components of the 
noisy passage “took” 

obtained either from the phase of the difference spectrum or 
from the phase of a reference noise signal. Figure 1 shows 
the speech passage “took” spoken by a male speaker. This 
passage was decomposed by the DE speech system into the 
voiced and unvoiced components which are shown in Fig- 
ures 2(a) and (b) respectively. 

3. NEW SPEECH ENHANCEMENT METHOD 

A schematic of the DE speech enhancement system is shown 
in Figure 3. The system enhances the voiced and the un- 
voiced parts separately. Enhancement of the voiced compo- 
nent requires only a minor modification to account for the 
presence of the noise in the harmonic amplitudes; most of 
the noise reduction in the DE speech enhancement system 
is performed by processing the unvoiced component. 

The unvoiced spectrum does not contain any harmonic 
structure, and the unvoiced spectrum may be smoothed 
without introducing substantial distortion into the speech, 
The benefit of smoothing the unvoiced spectrum is a better 
estimate of the power spectrum of the unvoiced spectrum. 
The quality of this estimate is vital in subsequent spectral 
subtraction. 

3.1. Enhancement of the voiced component 

In general, the presence of noise in speech results in noisy 
parameter estimates. Since the voiced component is re- 
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Figure 3: Dual Excitation Speech Enhancement System 

3.2. Enhancement of the unvoiced component 

The goal here is to remove the noise, z(ra), from the unvoiced 
component as much as possible. This is achieved by a two- 
pass enhancement system [5]. In the first pass, harmonic 
bands where the voiced energy is substantially greater than 
the unvoiced energy are identified. In these regions the 
unvoiced energy is masked by the voiced energy, and hence 
the unvoiced energy can be eliminated without altering the 
perceived speech. The enhanced version of the difference 
spectrum, denoted as D w (w) is given by 

D w (u>) = 

where E vm and E UVm are the energies in the m’th harmonic 
band of the voiced and unvoiced components respectively. 
In the second pass where most of the noise reduction is per- 
formed, d(n) is synthesized from D w ( w), and then passed 
through a modified Wiener filter H W9t (u>). The Wiener fil- 
ter removes the background noise from the unvoiced speech 
in the regions of the spectrum which have a low speech- to- 
noise ratio. Mathematical representation of the modified 
Wiener filter is given by 


f 0 if | E Vm \> 3 E uv 

1 D w (u}) otherwise 


stricted to the subspace defined by a harmonic series, the 
noise in the voiced parameters is restricted to the same sub- 
space. This causes the noise in the voiced parameters to be 
perceived as harmonic noise in the synthesized speech sig- 
nal. There are essentially two parameters characterizing the 
voiced component - the fundamental frequency and the har- 
monic amplitudes. The fundamental frequency estimation 
error is assumed to be negligible [5]. Thus, the enhancement 
of the voiced component entails only the modification of the 
harmonic amplitudes to reduce the leakage of the noise into 
the voiced component; the noisy estimate of the harmonic 
amplitude, A m , has to be adjusted to account for the leak- 
age noise. The estimate of the m’th harmonic amplitude, 
Am, is eliminated if the effective noise at the corresponding 
frequency is greater than the estimate of the harmonic am- 
plitude. Am which denotes the enhanced version of A m is 
given by 


if | Am |< 

otherwise 


where P Z z(w) represents the noise power density, and N e js 
which represents the effect of the window is defined as 



Elimination of this leakage also causes a loss of harmonic 
energy of the actual voiced speech. However, this loss of 
harmonic energy is generally not perceived due to the low 
local speech- to- noise ratio in this harmonic band. In order 
to recapture the energy which has been removed from the 
voiced component, it is necessary to modify the difference 
spectrum to account for the above equation. 


otherwise 


where E[\ Z Wt9 (uj)\ 2 ] is the power spectrum of the noise 
and £[| D W99 (oj)\ ] is the smoothed unvoiced spectrum. 
The subscript signifies that each term is a short-time seg- 
ment which is obtained by application of a window function 
u> a ,(n). Typical values for a and (3 are 1.6 and .1 respec- 
tively. 
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4. EXPERIMENTAL RESULTS 

The DE speech enhancement system described above has 
been tested on a number of noisy speech passages. These 
passages have been generated by adding white Gaussian 
noise with known variance to a passage of clean speech. The 
signal- to-noise ratio of these passages varied between 10-30 
dB. Each noisy passage was processed using the DE speech 
enhancement system described above. The same passage 
was then processed using a traditional spectral subtraction 
speech enhancement system. The performance of these two 
systems was compared through informal listening. These 
comparisons indicated that the quality of the DE speech 
enhancement system was superior to that of the spectral 
subtraction system. There were clearly fewer artifacts in 
the speech processed by the DE speech enhancement sys- 
tem. Specifically the tonal noise common to spectral sub- 
traction approaches was virtually eliminated. In addition 
the DE speech enhancement method was perceived as pro- 
viding more noise reduction than the spectral subtraction 
method. 

In order to study the effects of speech enhancement 
for hearing impaired listeners, a study was conducted by 
Dr. William Rabinowitz at M.I.T.’s Research Laboratory 
of Electronics. Degraded and processed speech was pre- 
sented to a hearing impaired listener, and the intelligibility 
was evaluated for each set of material. The male speech 
that was processed by the DE speech enhancement system 
showed a 15 percent increase in intelligibility compared to 
the degraded speech. Similarly, the processed female speech 
showed a 23 percent increase in intelligibility compared to 
the degraded speech. 
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[5] J. Hardwick, The Dual Excitation Speech Model . PhD 
thesis, MIT, E.E.C.S. Department, June 1992. 
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gorithm,” Proc. of the Int. Conf. on Acoustics , Speech 
and Signal Proc., vol. 67, pp. 592-601, March 1984. 


5. CONCLUSIONS 

The DE speech enhancement system and its evaluation have 
been presented in this paper. Based on informal listening 
tests, this system outperformed the traditional spectral sub- 
traction system. Although the amount of noise reduction in 
the two systems was similar, the DE system did not contain 
the tonal artifacts which were present in the spectral sub- 
traction system. Preliminary evidence has shown that the 
DE speech enhancement system may be able to improve the 
intelligibility of noisy speech for hearing impaired listeners. 
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